A Collection Of Operational Practices Helps To Quickly Identify And Resolve Issues Related To Network Abnormalities Across The Entire US CN2 Network

2026-05-24 09:42:57
Current Location: Blog > US CN2
美国CN2

1.

Identify the scope and impact of the fault

- First, identify the affected entities: Access from a single client, a specific data center, or across multiple regions.
- Instructions for operation: Perform ping/traceroute tests on both the affected client and several normal clients separately, and record the time intervals and node IP addresses. Example: ping -c 10 8.8.8.8 ; traceroute -n -I 8.8.8.8 …
- Purpose: Determine whether it is a one-way packet loss/delay issue, or a problem with the bilateral connection, and whether it originates from the edge access, backbone CN2, or the destination ISP.

2.

Basic connectivity and latency diagnosis (Ping/MTR/Traceroute)

- Using mtr to obtain packet loss and latency distribution: mtr -rwzbc 100 <目标IP> (Linux) Logs the packet loss rate and average latency for each hop.
- Combining ICMP and TCP path checking: traceroute -I -n <目标> (ICMP) and tcptraceroute <目标> 443 (TCP): Compare the results to determine if there are any firewall restrictions or policy differences.
- Check the MTU/PMTU values: ping -M do -s 1472 <目标> Gradually reduce the packet size to determine whether there are any transmission issues caused by the loss of fragments or DFs.

3.

Troubleshooting at the routing and BGP levels

- View the BGP routes between this end and the upstream node: Cisco example: show ip bgp <目标前缀> ; Juniper: show route <前缀> protocol bgp.
- Check the AS-Path, MED, LocalPref, and whether there are any blackhole/community labels that could cause traffic to be discarded or redirected to unintended paths.
- Use public-facing Looking Glass tools as well as RIPE/ARIN tools to verify reachability from different autonomous systems. For example, you can use Looking Glass services provided by Hurricane Electric or China Telecom to compare global reachability.

4.

Rapid checks at the link and interface levels

- Viewing interface counters on switching/routing devices: show interface GigabitEthernet0/0 (packet losses, CRC errors, input errors).
- If physical layer issues are detected (such as CRC errors or frame verification failures), immediately contact the operator of the underlying link or the fiber optic maintenance team, and provide the time window and interface name.
- For MPLS or L2VPN environments, check the status of LSPs/VCs: Use the `show mpls lsp`, `show xconnect`, or `show l2vpn` commands to check if there are any tunnels that are down or if there are any label errors.

5.

Packet capture and traffic analysis (tcpdump/Wireshark)

- Execute tcpdump on the edge device or target machine: tcpdump -i eth0 host <目标IP> and \(tcp or icmp\) -w /tmp/capture.pcap ; The timestamp must be accurate (UTC/local time).
- Pay close attention: Whether the TCP three-way handshake is completed, RST/ICMP unreachable messages, fragmentation required (PMTU), duplicate ACKs, and retransmissions.
- Open the pcap file using Wireshark, and examine the SYN/ACK packets in sequence along with the delay and retransmission intervals. Take screenshots of the relevant packets for reporting to the operator or internal development teams.

6.

Evidence collection and reporting templates (for carriers/co-workers)

- Essential Information Checklist: Start and end times of the issue (including time zone), affected public and private IP addresses, traceroute results (each hop including IP address and AS), mtr output, tcpdump pcap files, device output from “show interface” and “show bgp”, as well as information on top talkers (using NetFlow/sFlow).
- Recommended reporting format: Timeline → Scope of Impact → Steps to Replicate the Issue → Supporting Evidence (file names and summaries) → Points That Operators Should Check (optical path, forwarding plane, routing policies, firewall policies).
- Make clear requests to the operators, such as “Please check the incoming traffic on the border routers in AS xxxx.” 1.2.3.0/24 Whether there are any packet loss issues or problems with the RIB/TCAM in the BGP neighbors.

7.

Common causes and targeted troubleshooting steps

- Routing error/black hole: Check the BGP community and filtering policies; if incorrect filtering is occurring, restore the prefix and remove the relevant community. Example of repair steps: Remove the filter from the router or modify the route-map, and then clear ip bgp Soft in/out.
- MTU/PMTU values causing service disruptions: Enable MSS clamping on border devices (e.g., using `ip tcp adjust-mss 1360`) or adjust the link MTU to prevent ICMP packets from being discarded.
- Physical/optical path issues: Upon detecting interface CRC errors or jitter, submit an OTDR test on the optical cable, or request the carrier to replace the optical modules or re-terminate the link.

8.

Temporary detours and strategies to mitigate the impact

- Use BGP communities to direct traffic through alternative CN2 POPs or other routes back to China: Add a higher localpref or prepend for specific prefixes, or negotiate community policies with the carriers.
- Quick gray scale: Perform DNS load balancing for critical services or use Anycast/multi-export to distribute traffic.
- For short-term, high-impact failures, consider enabling traffic throttling or QoS on critical devices to protect the control plane and ensure that vital services receive priority.

9.

Suggestions for long-term prevention and monitoring improvements

- Deploy multi-point monitoring solutions based on active detection (such as RIPE Atlas or custom probes) to continuously monitor the CN2 route in the United States. Set up alerts for RTT and packet loss, and enable automatic capture of pcap packets.
- Create fault ticket templates and automated scripts: When the packet loss rate for a certain hop exceeds a threshold, the sampling and reporting scripts are automatically triggered, reducing the time required for human intervention.
- Regularly align BGP configurations, community policies, and operational contact channels with the carrier, and schedule monthly/quarterly coordination meetings.

10.

Case Review: A typical full journey US CN2 Interrupt handling process

- Review steps: 1) Received an alarm. 2) Used mtr/traceroute to determine that the issue was with the CN2 backbone. 3) Captured packets at the backbone entrance and observed a large number of TCP retransmissions and ICMP “unreachable” messages. 4) Checked the upstream BGP configuration and found that some prefixes had been marked as blackholes by the routing community. 5) Communicated with the ISP and provided all the relevant evidence. 6) After the ISP corrected the routing configuration, verification confirmed that the issue had been resolved.
- Lessons learned and areas for improvement: Add automated evidence collection scripts, and configure multiple outlets and flexible community policies for key prefixes to reduce the impact of single points of failure.

11.

Ask: If traceroute shows that packets are being lost significantly at a particular hop, does that mean the device is broken?

- Answer: Not necessarily. Packet loss in traceroute may be due to the device at that hop assigning a lower priority to ICMP/TCP responses or implementing rate limits on these responses. The key is to determine whether subsequent hops are also affected ; Use the trend of packet loss/delay observed by mtr, along with tcpdump data from the server side (to check for excessive retransmissions or temporary unavailability), to determine whether the packet loss is actual or merely due to lost responses.

12.

Ask: What are some of the most easily overlooked but crucial pieces of information when reporting to a carrier?

- Answer: Information that is often overlooked includes the exact start and end times of the outage (including time zone), device interface counters (CRC/err), the original mtr output files for the entire duration, tcpdump pcap packets (with timestamps), as well as sample Five-tuples representing the affected traffic (source/destination IP address, port, protocol). These details can significantly reduce the time it takes for operators to locate the issue.

13.

Ask: How can we minimize the impact on business operations in the short term while ensuring a positive user experience?

- Answer: Possible measures that can be taken include:: Temporarily adjusting BGP policies to direct traffic through alternate domestic links, enabling multi-export DNS resolution or Anycast at the application layer, caching popular content via CDN, and optimizing retry and timeout mechanisms on the user side can significantly reduce the impact on user experience until the underlying infrastructure issues are fully resolved.

Latest articles
Evaluation And Comparison Of The Stability And Speed Of Low-priced Taiwan Vps High-defense Cloud Space
The Worry-free Hosting Plan Recommends Cheap Malaysian Vps Packages Suitable For Individual Webmasters
Network Architecture Hong Kong Nwt Vps Connection Optimization Practice Report In Hybrid Cloud Scenario
How To Get Korean Native Ip, Practical Steps Suitable For Cross-border E-commerce And Games
Data Supports The Practical Case Of User Feedback Collection And Content Optimization Shared By Bilibili Taiwan Server
Overwatch Vietnam Server Maintenance Announcement And Common Troubleshooting Suggestions
Comprehensive Comparison Of The Most Cost-effective Hosting Solutions Among The Us High-defense Server Rankings
How Much Does A Cloud Server In Vietnam Cost, Including A Complete Accounting Method For Bandwidth, Storage And Traffic Costs?
Developers Practice Korean Server Kuaishou Guangsuan Cloud Image Management And Automated Deployment
Case Analysis Of The Historical Doomsday Server Kicking Incident In The United States And Summary Of Improvement Measures
Popular tags
Related Articles